load("~/Documents/Nonparametric Statisics/Project/clean data/functional/df_hour.RData")

first we start with outlier detection in the functional case

fd_day <- fData(1:24,as_tibble(df_hour[,2:25]))
plot(fd_day,lwd = 3,xlab = "day",ylab = "dayly number of crashes",main = "hourly crashes in each week")

functional bagplot:

hour_fbagplot <- fbplot(fd_day, main="Magnitude outliers horly data")

the default F is:

hour_fbagplot$Fvalue
[1] 1.5

no outliers found

df_hour[hour_fbagplot$ID_outliers,1:2]

observations with 0 crashes create a numerical problem, we add 1 to everithing

df_hour2 <- df_hour + 1
fd_day2 <- fData(1:24,as_tibble(df_hour[,2:25]))
hour_fbagplot2 <- fbplot(fd_day2, main="Magnitude outliers week data",
                                  adjust = list( N_trials = 20,trial_size = fd_day2$N,
                                                 VERBOSE = TRUE ))
 * * * Iteration  1  /  20 
 * * * * beginning optimization
Error in stats::uniroot(cost_functional, interval = c(F_min, F_max), tol = tol,  : 
  f() values at end points not of opposite sign

this is not working, don’t know why.

chosing a smalle F value:

hour_fbagplot2 <- fbplot(fd_day, main="Magnitude outliers week data",Fvalue = 0.9)

the chosen F value is:

hour_fbagplot2$Fvalue
[1] 0.9

the outlying years are:

df_hour[hour_fbagplot2$ID_outliers,1]

new yeas eve and 2 strange days : 2005-07-31 and 2009-02-11

outiliergram:

invisible(out_hour<- outliergram(fd_day,adjust = F,lwd = 3,display = F))

the found outliers are:

df_hour[out_hour$ID_outliers,1]

the plot of the original function is not working.

adjusting the F:

out_hour <- outliergram(fd_day,lwd = 3,adjust = list( N_trials = 20,trial_size = 8*fd_day$N,
                                                 VERBOSE = TRUE ),display = FALSE)
 * * * Iteration  1  /  20 
 * * * * beginning optimization
 * * * Iteration  2  /  20 
 * * * * beginning optimization
 * * * Iteration  3  /  20 
 * * * * beginning optimization
 * * * Iteration  4  /  20 
 * * * * beginning optimization
 * * * Iteration  5  /  20 
 * * * * beginning optimization
 * * * Iteration  6  /  20 
 * * * * beginning optimization
 * * * Iteration  7  /  20 
 * * * * beginning optimization
 * * * Iteration  8  /  20 
 * * * * beginning optimization
 * * * Iteration  9  /  20 
 * * * * beginning optimization
 * * * Iteration  10  /  20 
 * * * * beginning optimization
 * * * Iteration  11  /  20 
 * * * * beginning optimization
 * * * Iteration  12  /  20 
 * * * * beginning optimization
 * * * Iteration  13  /  20 
 * * * * beginning optimization
 * * * Iteration  14  /  20 
 * * * * beginning optimization
 * * * Iteration  15  /  20 
 * * * * beginning optimization
 * * * Iteration  16  /  20 
 * * * * beginning optimization
 * * * Iteration  17  /  20 
 * * * * beginning optimization
 * * * Iteration  18  /  20 
 * * * * beginning optimization
 * * * Iteration  19  /  20 
 * * * * beginning optimization
 * * * Iteration  20  /  20 
 * * * * beginning optimization
out_hour$Fvalue
[1] 2.055792

nothing changed, same outliers detected.

df_hour[out_hour$ID_outliers,1:2]

plotting in the old way.

par(mfrow=c(1,2))
plot(fd_day[out_hour$ID_outliers,],lwd = 1,main = "outliers",col = 2)
plot(fd_day[-out_hour$ID_outliers,],lwd = 1,main = "non outliers",col = 3)

there are 2 typer of outtliers: very low crashes days, very high crashes day.

lets do some clustering:

hours <- 1:24
n <- fd_day$N
x <- t(matrix(rep(hours,n),24,n))
y <- as.matrix(df_hour[,2:25])
k <- 3
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
Information about the data set:
 - Number of observations: 6574
 - Number of dimensions: 1
 - Number of points: 24

Information about cluster initialization:
 - Number of clusters: 3
 - Initial seeds for cluster centers:          3405         5667         1171

Information about the methods used within the algorithm:
 - Warping method: none
 - Center method: mean
 - Dissimilarity method: pearson
 - Optimization method: bobyqa

Information about warping parameter bounds:
 - Warping options:    0.1500   0.1500

Information about convergence criteria:
 - Maximum number of iterations: 100
 - Distance relative tolerance: 0.001

Information about parallelization setup:
 - Number of threads: 12
 - Parallel method: 0

Other information:
 - Use fence to robustify: 0
 - Check total dissimilarity: 1
 - Compute overall center: 0

Running k-centroid algorithm:
 - Iteration #1
   * Size of cluster #0: 4049
   * Size of cluster #1: 1239
   * Size of cluster #2: 1286
 - Iteration #2
   * Size of cluster #0: 3456
   * Size of cluster #1: 1233
   * Size of cluster #2: 1885
 - Iteration #3
   * Size of cluster #0: 3148
   * Size of cluster #1: 1443
   * Size of cluster #2: 1983
 - Iteration #4
   * Size of cluster #0: 2937
   * Size of cluster #1: 1589
   * Size of cluster #2: 2048
 - Iteration #5
   * Size of cluster #0: 2767
   * Size of cluster #1: 1724
   * Size of cluster #2: 2083
 - Iteration #6
   * Size of cluster #0: 2637
   * Size of cluster #1: 1832
   * Size of cluster #2: 2105
 - Iteration #7
   * Size of cluster #0: 2520
   * Size of cluster #1: 1935
   * Size of cluster #2: 2119
 - Iteration #8
   * Size of cluster #0: 2436
   * Size of cluster #1: 2007
   * Size of cluster #2: 2131
 - Iteration #9
   * Size of cluster #0: 2373
   * Size of cluster #1: 2067
   * Size of cluster #2: 2134

Active stopping criteria:
 - The total dissimilarity did not decrease.
   user  system elapsed 
 37.655   0.394  38.089 
autoplot(fdakma0der)

matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:5,lwd = 3)

k <- 3
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "l2",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
Information about the data set:
 - Number of observations: 6574
 - Number of dimensions: 1
 - Number of points: 24

Information about cluster initialization:
 - Number of clusters: 3
 - Initial seeds for cluster centers:          1957          365         4321

Information about the methods used within the algorithm:
 - Warping method: none
 - Center method: mean
 - Dissimilarity method: l2
 - Optimization method: bobyqa

Information about warping parameter bounds:
 - Warping options:    0.1500   0.1500

Information about convergence criteria:
 - Maximum number of iterations: 100
 - Distance relative tolerance: 0.001

Information about parallelization setup:
 - Number of threads: 12
 - Parallel method: 0

Other information:
 - Use fence to robustify: 0
 - Check total dissimilarity: 1
 - Compute overall center: 0

Running k-centroid algorithm:
 - Iteration #1
   * Size of cluster #0: 3398
   * Size of cluster #1: 2512
   * Size of cluster #2: 664
 - Iteration #2
   * Size of cluster #0: 3510
   * Size of cluster #1: 2379
   * Size of cluster #2: 685
 - Iteration #3
   * Size of cluster #0: 3567
   * Size of cluster #1: 2358
   * Size of cluster #2: 649

Active stopping criteria:
 - The total dissimilarity did not decrease.
   user  system elapsed 
 15.469   0.213  15.699 
autoplot(fdakma0der)

matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:5,lwd = 3)

pearson is the prefered metric, since it takes into account the shape as well.

we need to chose the k:

n_sub <- 50
sub_id <- sample(1:n,n_sub,replace = FALSE)
x_sub <- x[sub_id,]
y_sub <- y[sub_id,]

system.time(invisible(comparison_kmeans <- compare_caps(
  x_sub,
  y_sub,
  n_clusters = 1:4,
  metric = "pearson",
  clustering_method = "kmeans",
  warping_class = "none",
  centroid_type = "mean",
  cluster_on_phase = FALSE
    )
  )
)
plot(comparison_kmeans, validation_criterion = "wss", what = "mean",lwd = 3)

plot(comparison_kmeans, validation_criterion = "wss", what = "distribution")

plot(comparison_kmeans, validation_criterion = "silhouette", what = "mean")

plot(comparison_kmeans, validation_criterion = "silhouette", what = "distribution")

it is not clear, but 2 should be the best

k <- 2
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 20L
)
)
autoplot(fdakma0der)

matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:5,lwd = 3)

this is a clear distinction between working days and holydays.

trying k = 4

k <- 4
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 20L
)
)
Information about the data set:
 - Number of observations: 6574
 - Number of dimensions: 1
 - Number of points: 24

Information about cluster initialization:
 - Number of clusters: 4
 - Initial seeds for cluster centers:          4874         3516         2556         4729

Information about the methods used within the algorithm:
 - Warping method: none
 - Center method: mean
 - Dissimilarity method: pearson
 - Optimization method: bobyqa

Information about warping parameter bounds:
 - Warping options:    0.1500   0.1500

Information about convergence criteria:
 - Maximum number of iterations: 100
 - Distance relative tolerance: 0.001

Information about parallelization setup:
 - Number of threads: 20
 - Parallel method: 0

Other information:
 - Use fence to robustify: 0
 - Check total dissimilarity: 1
 - Compute overall center: 0

Running k-centroid algorithm:
 - Iteration #1
   * Size of cluster #0: 2274
   * Size of cluster #1: 750
   * Size of cluster #2: 34
   * Size of cluster #3: 3516
 - Iteration #2
   * Size of cluster #0: 1453
   * Size of cluster #1: 1277
   * Size of cluster #2: 232
   * Size of cluster #3: 3612
 - Iteration #3
   * Size of cluster #0: 1299
   * Size of cluster #1: 1397
   * Size of cluster #2: 467
   * Size of cluster #3: 3411
 - Iteration #4
   * Size of cluster #0: 1408
   * Size of cluster #1: 1402
   * Size of cluster #2: 617
   * Size of cluster #3: 3147
 - Iteration #5
   * Size of cluster #0: 1566
   * Size of cluster #1: 1362
   * Size of cluster #2: 721
   * Size of cluster #3: 2925
 - Iteration #6
   * Size of cluster #0: 1704
   * Size of cluster #1: 1324
   * Size of cluster #2: 792
   * Size of cluster #3: 2754
 - Iteration #7
   * Size of cluster #0: 1824
   * Size of cluster #1: 1283
   * Size of cluster #2: 849
   * Size of cluster #3: 2618
 - Iteration #8
   * Size of cluster #0: 1921
   * Size of cluster #1: 1270
   * Size of cluster #2: 881
   * Size of cluster #3: 2502
 - Iteration #9
   * Size of cluster #0: 1987
   * Size of cluster #1: 1269
   * Size of cluster #2: 901
   * Size of cluster #3: 2417
 - Iteration #10
   * Size of cluster #0: 2056
   * Size of cluster #1: 1257
   * Size of cluster #2: 922
   * Size of cluster #3: 2339
 - Iteration #11
   * Size of cluster #0: 2104
   * Size of cluster #1: 1255
   * Size of cluster #2: 934
   * Size of cluster #3: 2281

Active stopping criteria:
 - The total dissimilarity did not decrease.
   user  system elapsed 
 55.476   0.725  56.717 
autoplot(fdakma0der)

matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:5,lwd = 3)

it is clear that this is just a magnitude thing for the first peaks

trying k = 10

k <- 10
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 20L
)
)
Information about the data set:
 - Number of observations: 6574
 - Number of dimensions: 1
 - Number of points: 24

Information about cluster initialization:
 - Number of clusters: 10
 - Initial seeds for cluster centers:          3165         5574         5201         4890          680          611         5666          351         1358         2385

Information about the methods used within the algorithm:
 - Warping method: none
 - Center method: mean
 - Dissimilarity method: pearson
 - Optimization method: bobyqa

Information about warping parameter bounds:
 - Warping options:    0.1500   0.1500

Information about convergence criteria:
 - Maximum number of iterations: 100
 - Distance relative tolerance: 0.001

Information about parallelization setup:
 - Number of threads: 20
 - Parallel method: 0

Other information:
 - Use fence to robustify: 0
 - Check total dissimilarity: 1
 - Compute overall center: 0

Running k-centroid algorithm:
 - Iteration #1
   * Size of cluster #0: 439
   * Size of cluster #1: 11
   * Size of cluster #2: 278
   * Size of cluster #3: 786
   * Size of cluster #4: 660
   * Size of cluster #5: 722
   * Size of cluster #6: 351
   * Size of cluster #7: 403
   * Size of cluster #8: 974
   * Size of cluster #9: 1950
 - Iteration #2
   * Size of cluster #0: 517
   * Size of cluster #1: 85
   * Size of cluster #2: 377
   * Size of cluster #3: 1066
   * Size of cluster #4: 507
   * Size of cluster #5: 597
   * Size of cluster #6: 783
   * Size of cluster #7: 475
   * Size of cluster #8: 853
   * Size of cluster #9: 1314
 - Iteration #3
   * Size of cluster #0: 544
   * Size of cluster #1: 200
   * Size of cluster #2: 452
   * Size of cluster #3: 1036
   * Size of cluster #4: 436
   * Size of cluster #5: 517
   * Size of cluster #6: 836
   * Size of cluster #7: 491
   * Size of cluster #8: 848
   * Size of cluster #9: 1214
 - Iteration #4
   * Size of cluster #0: 554
   * Size of cluster #1: 278
   * Size of cluster #2: 539
   * Size of cluster #3: 1000
   * Size of cluster #4: 411
   * Size of cluster #5: 476
   * Size of cluster #6: 833
   * Size of cluster #7: 498
   * Size of cluster #8: 853
   * Size of cluster #9: 1132
 - Iteration #5
   * Size of cluster #0: 559
   * Size of cluster #1: 332
   * Size of cluster #2: 621
   * Size of cluster #3: 964
   * Size of cluster #4: 410
   * Size of cluster #5: 445
   * Size of cluster #6: 843
   * Size of cluster #7: 495
   * Size of cluster #8: 866
   * Size of cluster #9: 1039
 - Iteration #6
   * Size of cluster #0: 572
   * Size of cluster #1: 377
   * Size of cluster #2: 686
   * Size of cluster #3: 933
   * Size of cluster #4: 405
   * Size of cluster #5: 431
   * Size of cluster #6: 843
   * Size of cluster #7: 485
   * Size of cluster #8: 874
   * Size of cluster #9: 968
 - Iteration #7
   * Size of cluster #0: 585
   * Size of cluster #1: 399
   * Size of cluster #2: 735
   * Size of cluster #3: 912
   * Size of cluster #4: 400
   * Size of cluster #5: 425
   * Size of cluster #6: 859
   * Size of cluster #7: 480
   * Size of cluster #8: 863
   * Size of cluster #9: 916
 - Iteration #8
   * Size of cluster #0: 598
   * Size of cluster #1: 408
   * Size of cluster #2: 769
   * Size of cluster #3: 904
   * Size of cluster #4: 401
   * Size of cluster #5: 418
   * Size of cluster #6: 877
   * Size of cluster #7: 474
   * Size of cluster #8: 848
   * Size of cluster #9: 877
 - Iteration #9
   * Size of cluster #0: 608
   * Size of cluster #1: 417
   * Size of cluster #2: 792
   * Size of cluster #3: 896
   * Size of cluster #4: 398
   * Size of cluster #5: 412
   * Size of cluster #6: 883
   * Size of cluster #7: 477
   * Size of cluster #8: 845
   * Size of cluster #9: 846
 - Iteration #10
   * Size of cluster #0: 625
   * Size of cluster #1: 420
   * Size of cluster #2: 810
   * Size of cluster #3: 892
   * Size of cluster #4: 394
   * Size of cluster #5: 408
   * Size of cluster #6: 884
   * Size of cluster #7: 474
   * Size of cluster #8: 843
   * Size of cluster #9: 824
 - Iteration #11
   * Size of cluster #0: 627
   * Size of cluster #1: 421
   * Size of cluster #2: 812
   * Size of cluster #3: 888
   * Size of cluster #4: 392
   * Size of cluster #5: 410
   * Size of cluster #6: 887
   * Size of cluster #7: 481
   * Size of cluster #8: 850
   * Size of cluster #9: 806
 - Iteration #12
   * Size of cluster #0: 626
   * Size of cluster #1: 428
   * Size of cluster #2: 815
   * Size of cluster #3: 879
   * Size of cluster #4: 397
   * Size of cluster #5: 411
   * Size of cluster #6: 888
   * Size of cluster #7: 485
   * Size of cluster #8: 852
   * Size of cluster #9: 793
 - Iteration #13
   * Size of cluster #0: 633
   * Size of cluster #1: 430
   * Size of cluster #2: 820
   * Size of cluster #3: 866
   * Size of cluster #4: 401
   * Size of cluster #5: 413
   * Size of cluster #6: 883
   * Size of cluster #7: 490
   * Size of cluster #8: 850
   * Size of cluster #9: 788
 - Iteration #14
   * Size of cluster #0: 643
   * Size of cluster #1: 436
   * Size of cluster #2: 826
   * Size of cluster #3: 860
   * Size of cluster #4: 406
   * Size of cluster #5: 409
   * Size of cluster #6: 872
   * Size of cluster #7: 494
   * Size of cluster #8: 849
   * Size of cluster #9: 779
 - Iteration #15
   * Size of cluster #0: 648
   * Size of cluster #1: 451
   * Size of cluster #2: 826
   * Size of cluster #3: 856
   * Size of cluster #4: 414
   * Size of cluster #5: 410
   * Size of cluster #6: 852
   * Size of cluster #7: 493
   * Size of cluster #8: 847
   * Size of cluster #9: 777
 - Iteration #16
   * Size of cluster #0: 654
   * Size of cluster #1: 460
   * Size of cluster #2: 825
   * Size of cluster #3: 855
   * Size of cluster #4: 422
   * Size of cluster #5: 405
   * Size of cluster #6: 848
   * Size of cluster #7: 491
   * Size of cluster #8: 840
   * Size of cluster #9: 774
 - Iteration #17
   * Size of cluster #0: 658
   * Size of cluster #1: 465
   * Size of cluster #2: 818
   * Size of cluster #3: 850
   * Size of cluster #4: 425
   * Size of cluster #5: 401
   * Size of cluster #6: 841
   * Size of cluster #7: 494
   * Size of cluster #8: 842
   * Size of cluster #9: 780
 - Iteration #18
   * Size of cluster #0: 663
   * Size of cluster #1: 470
   * Size of cluster #2: 815
   * Size of cluster #3: 846
   * Size of cluster #4: 436
   * Size of cluster #5: 394
   * Size of cluster #6: 833
   * Size of cluster #7: 492
   * Size of cluster #8: 845
   * Size of cluster #9: 780
 - Iteration #19
   * Size of cluster #0: 669
   * Size of cluster #1: 486
   * Size of cluster #2: 814
   * Size of cluster #3: 841
   * Size of cluster #4: 441
   * Size of cluster #5: 391
   * Size of cluster #6: 818
   * Size of cluster #7: 493
   * Size of cluster #8: 838
   * Size of cluster #9: 783
 - Iteration #20
   * Size of cluster #0: 680
   * Size of cluster #1: 493
   * Size of cluster #2: 809
   * Size of cluster #3: 837
   * Size of cluster #4: 444
   * Size of cluster #5: 394
   * Size of cluster #6: 802
   * Size of cluster #7: 495
   * Size of cluster #8: 836
   * Size of cluster #9: 784
 - Iteration #21
   * Size of cluster #0: 687
   * Size of cluster #1: 508
   * Size of cluster #2: 808
   * Size of cluster #3: 831
   * Size of cluster #4: 444
   * Size of cluster #5: 402
   * Size of cluster #6: 780
   * Size of cluster #7: 493
   * Size of cluster #8: 835
   * Size of cluster #9: 786
 - Iteration #22
   * Size of cluster #0: 700
   * Size of cluster #1: 521
   * Size of cluster #2: 808
   * Size of cluster #3: 829
   * Size of cluster #4: 446
   * Size of cluster #5: 404
   * Size of cluster #6: 767
   * Size of cluster #7: 489
   * Size of cluster #8: 831
   * Size of cluster #9: 779
 - Iteration #23
   * Size of cluster #0: 705
   * Size of cluster #1: 537
   * Size of cluster #2: 804
   * Size of cluster #3: 825
   * Size of cluster #4: 448
   * Size of cluster #5: 408
   * Size of cluster #6: 749
   * Size of cluster #7: 490
   * Size of cluster #8: 833
   * Size of cluster #9: 775
 - Iteration #24
   * Size of cluster #0: 707
   * Size of cluster #1: 566
   * Size of cluster #2: 803
   * Size of cluster #3: 820
   * Size of cluster #4: 446
   * Size of cluster #5: 409
   * Size of cluster #6: 725
   * Size of cluster #7: 495
   * Size of cluster #8: 832
   * Size of cluster #9: 771
 - Iteration #25
   * Size of cluster #0: 712
   * Size of cluster #1: 585
   * Size of cluster #2: 802
   * Size of cluster #3: 815
   * Size of cluster #4: 449
   * Size of cluster #5: 410
   * Size of cluster #6: 708
   * Size of cluster #7: 494
   * Size of cluster #8: 831
   * Size of cluster #9: 768
 - Iteration #26
   * Size of cluster #0: 712
   * Size of cluster #1: 598
   * Size of cluster #2: 801
   * Size of cluster #3: 813
   * Size of cluster #4: 454
   * Size of cluster #5: 408
   * Size of cluster #6: 706
   * Size of cluster #7: 496
   * Size of cluster #8: 825
   * Size of cluster #9: 761
 - Iteration #27
   * Size of cluster #0: 712
   * Size of cluster #1: 613
   * Size of cluster #2: 804
   * Size of cluster #3: 812
   * Size of cluster #4: 453
   * Size of cluster #5: 405
   * Size of cluster #6: 698
   * Size of cluster #7: 505
   * Size of cluster #8: 816
   * Size of cluster #9: 756
 - Iteration #28
   * Size of cluster #0: 712
   * Size of cluster #1: 622
   * Size of cluster #2: 800
   * Size of cluster #3: 804
   * Size of cluster #4: 459
   * Size of cluster #5: 399
   * Size of cluster #6: 693
   * Size of cluster #7: 509
   * Size of cluster #8: 816
   * Size of cluster #9: 760
 - Iteration #29
   * Size of cluster #0: 710
   * Size of cluster #1: 626
   * Size of cluster #2: 797
   * Size of cluster #3: 796
   * Size of cluster #4: 464
   * Size of cluster #5: 391
   * Size of cluster #6: 696
   * Size of cluster #7: 515
   * Size of cluster #8: 818
   * Size of cluster #9: 761
 - Iteration #30
   * Size of cluster #0: 707
   * Size of cluster #1: 628
   * Size of cluster #2: 796
   * Size of cluster #3: 790
   * Size of cluster #4: 467
   * Size of cluster #5: 382
   * Size of cluster #6: 696
   * Size of cluster #7: 526
   * Size of cluster #8: 821
   * Size of cluster #9: 761
 - Iteration #31
   * Size of cluster #0: 709
   * Size of cluster #1: 629
   * Size of cluster #2: 794
   * Size of cluster #3: 787
   * Size of cluster #4: 461
   * Size of cluster #5: 382
   * Size of cluster #6: 697
   * Size of cluster #7: 531
   * Size of cluster #8: 823
   * Size of cluster #9: 761
 - Iteration #32
   * Size of cluster #0: 711
   * Size of cluster #1: 635
   * Size of cluster #2: 794
   * Size of cluster #3: 783
   * Size of cluster #4: 466
   * Size of cluster #5: 376
   * Size of cluster #6: 693
   * Size of cluster #7: 532
   * Size of cluster #8: 824
   * Size of cluster #9: 760
 - Iteration #33
   * Size of cluster #0: 710
   * Size of cluster #1: 638
   * Size of cluster #2: 795
   * Size of cluster #3: 782
   * Size of cluster #4: 473
   * Size of cluster #5: 373
   * Size of cluster #6: 690
   * Size of cluster #7: 531
   * Size of cluster #8: 821
   * Size of cluster #9: 761
 - Iteration #34
   * Size of cluster #0: 715
   * Size of cluster #1: 635
   * Size of cluster #2: 799
   * Size of cluster #3: 781
   * Size of cluster #4: 476
   * Size of cluster #5: 366
   * Size of cluster #6: 693
   * Size of cluster #7: 533
   * Size of cluster #8: 819
   * Size of cluster #9: 757
 - Iteration #35
   * Size of cluster #0: 710
   * Size of cluster #1: 634
   * Size of cluster #2: 800
   * Size of cluster #3: 780
   * Size of cluster #4: 483
   * Size of cluster #5: 365
   * Size of cluster #6: 694
   * Size of cluster #7: 538
   * Size of cluster #8: 815
   * Size of cluster #9: 755
 - Iteration #36
   * Size of cluster #0: 710
   * Size of cluster #1: 639
   * Size of cluster #2: 800
   * Size of cluster #3: 778
   * Size of cluster #4: 482
   * Size of cluster #5: 359
   * Size of cluster #6: 692
   * Size of cluster #7: 544
   * Size of cluster #8: 814
   * Size of cluster #9: 756
 - Iteration #37
   * Size of cluster #0: 711
   * Size of cluster #1: 641
   * Size of cluster #2: 802
   * Size of cluster #3: 777
   * Size of cluster #4: 491
   * Size of cluster #5: 351
   * Size of cluster #6: 689
   * Size of cluster #7: 544
   * Size of cluster #8: 813
   * Size of cluster #9: 755
 - Iteration #38
   * Size of cluster #0: 717
   * Size of cluster #1: 642
   * Size of cluster #2: 800
   * Size of cluster #3: 776
   * Size of cluster #4: 498
   * Size of cluster #5: 338
   * Size of cluster #6: 694
   * Size of cluster #7: 546
   * Size of cluster #8: 811
   * Size of cluster #9: 752
 - Iteration #39
   * Size of cluster #0: 718
   * Size of cluster #1: 643
   * Size of cluster #2: 799
   * Size of cluster #3: 776
   * Size of cluster #4: 502
   * Size of cluster #5: 329
   * Size of cluster #6: 697
   * Size of cluster #7: 553
   * Size of cluster #8: 806
   * Size of cluster #9: 751
 - Iteration #40
   * Size of cluster #0: 720
   * Size of cluster #1: 645
   * Size of cluster #2: 799
   * Size of cluster #3: 776
   * Size of cluster #4: 510
   * Size of cluster #5: 322
   * Size of cluster #6: 696
   * Size of cluster #7: 554
   * Size of cluster #8: 803
   * Size of cluster #9: 749
 - Iteration #41
   * Size of cluster #0: 718
   * Size of cluster #1: 652
   * Size of cluster #2: 798
   * Size of cluster #3: 776
   * Size of cluster #4: 517
   * Size of cluster #5: 317
   * Size of cluster #6: 690
   * Size of cluster #7: 556
   * Size of cluster #8: 802
   * Size of cluster #9: 748
 - Iteration #42
   * Size of cluster #0: 719
   * Size of cluster #1: 657
   * Size of cluster #2: 797
   * Size of cluster #3: 775
   * Size of cluster #4: 519
   * Size of cluster #5: 315
   * Size of cluster #6: 687
   * Size of cluster #7: 556
   * Size of cluster #8: 802
   * Size of cluster #9: 747
 - Iteration #43
   * Size of cluster #0: 718
   * Size of cluster #1: 659
   * Size of cluster #2: 796
   * Size of cluster #3: 773
   * Size of cluster #4: 521
   * Size of cluster #5: 316
   * Size of cluster #6: 687
   * Size of cluster #7: 555
   * Size of cluster #8: 803
   * Size of cluster #9: 746
 - Iteration #44
   * Size of cluster #0: 719
   * Size of cluster #1: 661
   * Size of cluster #2: 797
   * Size of cluster #3: 772
   * Size of cluster #4: 524
   * Size of cluster #5: 314
   * Size of cluster #6: 685
   * Size of cluster #7: 553
   * Size of cluster #8: 803
   * Size of cluster #9: 746

Active stopping criteria:
 - The total dissimilarity did not decrease.
   user  system elapsed 
407.936   2.718 410.909 
autoplot(fdakma0der)

matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:k,lwd = 3)

not much more information here, stick to 2

k <- 10
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "l2",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 20L
)
)
Information about the data set:
 - Number of observations: 6574
 - Number of dimensions: 1
 - Number of points: 24

Information about cluster initialization:
 - Number of clusters: 10
 - Initial seeds for cluster centers:          1888         2794          294         5269         4206         2609         2990         4427         3292         1192

Information about the methods used within the algorithm:
 - Warping method: none
 - Center method: mean
 - Dissimilarity method: l2
 - Optimization method: bobyqa

Information about warping parameter bounds:
 - Warping options:    0.1500   0.1500

Information about convergence criteria:
 - Maximum number of iterations: 100
 - Distance relative tolerance: 0.001

Information about parallelization setup:
 - Number of threads: 20
 - Parallel method: 0

Other information:
 - Use fence to robustify: 0
 - Check total dissimilarity: 1
 - Compute overall center: 0

Running k-centroid algorithm:
 - Iteration #1
   * Size of cluster #0: 179
   * Size of cluster #1: 760
   * Size of cluster #2: 284
   * Size of cluster #3: 460
   * Size of cluster #4: 316
   * Size of cluster #5: 2314
   * Size of cluster #6: 926
   * Size of cluster #7: 704
   * Size of cluster #8: 371
   * Size of cluster #9: 260
 - Iteration #2
   * Size of cluster #0: 336
   * Size of cluster #1: 848
   * Size of cluster #2: 629
   * Size of cluster #3: 419
   * Size of cluster #4: 402
   * Size of cluster #5: 1854
   * Size of cluster #6: 784
   * Size of cluster #7: 579
   * Size of cluster #8: 343
   * Size of cluster #9: 380
 - Iteration #3
   * Size of cluster #0: 474
   * Size of cluster #1: 793
   * Size of cluster #2: 865
   * Size of cluster #3: 343
   * Size of cluster #4: 393
   * Size of cluster #5: 1615
   * Size of cluster #6: 656
   * Size of cluster #7: 552
   * Size of cluster #8: 268
   * Size of cluster #9: 615
 - Iteration #4
   * Size of cluster #0: 558
   * Size of cluster #1: 773
   * Size of cluster #2: 1030
   * Size of cluster #3: 314
   * Size of cluster #4: 391
   * Size of cluster #5: 1441
   * Size of cluster #6: 610
   * Size of cluster #7: 521
   * Size of cluster #8: 192
   * Size of cluster #9: 744
 - Iteration #5
   * Size of cluster #0: 619
   * Size of cluster #1: 776
   * Size of cluster #2: 1122
   * Size of cluster #3: 289
   * Size of cluster #4: 416
   * Size of cluster #5: 1353
   * Size of cluster #6: 547
   * Size of cluster #7: 483
   * Size of cluster #8: 143
   * Size of cluster #9: 826
 - Iteration #6
   * Size of cluster #0: 666
   * Size of cluster #1: 798
   * Size of cluster #2: 1192
   * Size of cluster #3: 261
   * Size of cluster #4: 463
   * Size of cluster #5: 1273
   * Size of cluster #6: 464
   * Size of cluster #7: 462
   * Size of cluster #8: 112
   * Size of cluster #9: 883
 - Iteration #7
   * Size of cluster #0: 692
   * Size of cluster #1: 830
   * Size of cluster #2: 1244
   * Size of cluster #3: 217
   * Size of cluster #4: 521
   * Size of cluster #5: 1228
   * Size of cluster #6: 410
   * Size of cluster #7: 454
   * Size of cluster #8: 85
   * Size of cluster #9: 893
 - Iteration #8
   * Size of cluster #0: 730
   * Size of cluster #1: 856
   * Size of cluster #2: 1280
   * Size of cluster #3: 203
   * Size of cluster #4: 548
   * Size of cluster #5: 1206
   * Size of cluster #6: 347
   * Size of cluster #7: 430
   * Size of cluster #8: 74
   * Size of cluster #9: 900

Active stopping criteria:
 - The total dissimilarity did not decrease.
   user  system elapsed 
121.775   1.155 122.930 
autoplot(fdakma0der)

matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:k,lwd = 3)

not much more information here, stick to 2

---
title: "Nonparametric Analysis of UK Road Accidents"
subtitle: "Functional clustering dayly data"
author:
    - "Valeria Iapaolo"
    - "Oswaldo Morales"
    - "Riccardo Morandi"
    - "Abylai Orynbassar"
    
output: html_notebook
---

```{r setup, echo = FALSE}
knitr::opts_chunk$set(
    echo = TRUE,
    #dev = c('pdf'),
    fig.align = 'center',
    #fig.path = 'output/',
    fig.height = 6,
    fig.width = 12
)
```

```{r libraries inclusions, include=FALSE}
library(tidyverse)
library(roahd)
#library(fda)
library(fdacluster)
```

```{r}
load("~/Documents/Nonparametric Statisics/Project/clean data/functional/df_hour.RData")
```

first we start with outlier detection in the functional case

```{r}
fd_day <- fData(1:24,as_tibble(df_hour[,2:25]))
```

```{r}
plot(fd_day,lwd = 3,xlab = "day",ylab = "dayly number of crashes",main = "hourly crashes in each week")
```

functional bagplot:

```{r}
hour_fbagplot <- fbplot(fd_day, main="Magnitude outliers horly data")
```

the default F is:

```{r}
hour_fbagplot$Fvalue
```


no outliers found

```{r}
df_hour[hour_fbagplot$ID_outliers,1:2]
```


observations with 0 crashes create a numerical problem, we add 1 to everithing

```{r}
df_hour2 <- df_hour + 1
fd_day2 <- fData(1:24,as_tibble(df_hour[,2:25]))
```

```{r}
hour_fbagplot2 <- fbplot(fd_day2, main="Magnitude outliers week data",
                                  adjust = list( N_trials = 20,trial_size = fd_day2$N,
                                                 VERBOSE = TRUE ))
```

this is not working, don't know why.

chosing a smalle F value:

```{r}
hour_fbagplot2 <- fbplot(fd_day, main="Magnitude outliers week data",Fvalue = 0.9)
```

the chosen F value is:

```{r}
hour_fbagplot2$Fvalue
```

the outlying years are:

```{r}
df_hour[hour_fbagplot2$ID_outliers,1]
```

new yeas eve and 2 strange days : 2005-07-31	and 2009-02-11


outiliergram:

```{r}
invisible(out_hour<- outliergram(fd_day,adjust = F,lwd = 3,display = F))
```

the found outliers are:

```{r}
df_hour[out_hour$ID_outliers,1]
```

the plot of the original function is not working.

adjusting the F:

```{r}
out_hour <- outliergram(fd_day,lwd = 3,adjust = list( N_trials = 20,trial_size = 8*fd_day$N,
                                                 VERBOSE = TRUE ),display = FALSE)
```

```{r}
out_hour$Fvalue
```

nothing changed, same outliers detected.

```{r}
df_hour[out_hour$ID_outliers,1:2]
```

plotting in the old way.

```{r}
par(mfrow=c(1,2))
plot(fd_day[out_hour$ID_outliers,],lwd = 1,main = "outliers",col = 2)
plot(fd_day[-out_hour$ID_outliers,],lwd = 1,main = "non outliers",col = 3)
```

there are 2 typer of outtliers:
very low crashes days,
very high crashes day.

lets do some clustering:


```{r}
hours <- 1:24
n <- fd_day$N
x <- t(matrix(rep(hours,n),24,n))
y <- as.matrix(df_hour[,2:25])
```

```{r}
k <- 3
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
```

```{r}
autoplot(fdakma0der)
```

```{r}
matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:5,lwd = 3)
```

```{r}
k <- 3
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "l2",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
```

```{r}
autoplot(fdakma0der)
```


```{r}
matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:5,lwd = 3)
```

pearson is the prefered metric, since it takes into account the shape as well.

we need to chose the k:

```{r}
n_sub <- 50
sub_id <- sample(1:n,n_sub,replace = FALSE)
x_sub <- x[sub_id,]
y_sub <- y[sub_id,]

system.time(invisible(comparison_kmeans <- compare_caps(
  x_sub,
  y_sub,
  n_clusters = 1:4,
  metric = "pearson",
  clustering_method = "kmeans",
  warping_class = "none",
  centroid_type = "mean",
  cluster_on_phase = FALSE
    )
  )
)
```

```{r}
plot(comparison_kmeans, validation_criterion = "wss", what = "mean",lwd = 3)
```

```{r}
plot(comparison_kmeans, validation_criterion = "wss", what = "distribution")
```

```{r}
plot(comparison_kmeans, validation_criterion = "silhouette", what = "mean")
```

```{r}
plot(comparison_kmeans, validation_criterion = "silhouette", what = "distribution")
```

it is not clear, but 2 should be the best

```{r}
k <- 2
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 20L
)
)
```

```{r}
autoplot(fdakma0der)
```

```{r}
matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:5,lwd = 3)
```

this is a clear distinction between working days and holydays.

trying k = 4

```{r}
k <- 4
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 20L
)
)
```

```{r}
autoplot(fdakma0der)
```

```{r}
matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:5,lwd = 3)
```

it is clear that this is just a magnitude thing for the first peaks

trying k = 10

```{r}
k <- 10
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 20L
)
)
```

```{r}
autoplot(fdakma0der)
```

```{r}
matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:k,lwd = 3)
```

not much more information here, stick to 2

```{r}
k <- 10
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "l2",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 20L
)
)
```

```{r}
autoplot(fdakma0der)
```

```{r}
matplot(t(fdakma0der$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:k,lwd = 3)
```

not much more information here, stick to 2




